3 Logistic regression

Data

Task 2: Reading and describing the data


library(tidyverse); library(psych); library(knitr); library(kableExtra)
alc <- read.table("data/alc.csv", sep = ",", header = T)


The data is Student Performance Data Set from UCI Machine Learning Repository. It approaches student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). Full description of the original data can be found here.

The two datasets have been merged using several variables as identifiers to combine individual students information: school, sex, age, address, family size, parents’ cohabitation status, parents’ education and job, reason to choose school, attendance to school nursery, and home internet access. If the student had answered to the same question on both questionnaires, the the rounded average was calculated. If the question was non-numeric, the answer on Mathematics performance dataset was used.

The R script about creating the merged dataset can be found here.


Variables

##  [1] X          school     sex        age        address    famsize   
##  [7] Pstatus    Medu       Fedu       Mjob       Fjob       reason    
## [13] nursery    internet   guardian   traveltime studytime  failures  
## [19] schoolsup  famsup     paid       activities higher     romantic  
## [25] famrel     freetime   goout      Dalc       Walc       health    
## [31] absences   G1         G2         G3         alc_use    high_use


Table 1 Information about variables used in analyses
Variables Information
sex ‘s sex (binary: ’F’ - female or ‘M’ - male)
Medu mother’s education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education)
failures number of past class failures (numeric: n if 1<=n<3, else 4)
absences number of school absences (numeric: from 0 to 93)
high_use high alcohol consumption (TRUE: the average of self-reported workday and weekend alcohol consumption greater than 2 on a scale 1 -very low - 5 very high, FALSE: the average 2 or lower)


Let’s take a glimpse of the structure and dimensions of the subset of data we are using:


# subsetting
sub <- c("sex", "Medu", "failures", "absences", "G3")
glimpse(alc[,sub])
## Observations: 382
## Variables: 5
## $ sex      <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,...
## $ Medu     <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ G3       <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...


As Medu is truly a categorical variable, not a numerical one, it will have to be recoded into a factor.


alc$Medu <- factor(alc$Medu)
levels(alc$Medu) <- c("None", "Primary", "5th to 9th", "Secondary", "Higher")








Hypotheses

Task 3: Choosing the variables and presenting hypotheses

The purpose of my analysis is to study the relationships between high/low alcohol consumption and students’ demographic, social, and school-realted characetristics.

I chose following variables to explain the students’ alcohol consumption: student’s sex, father’s education, class failures, and absentees.

  1. Student’s sex: Boys and young men usually consume more alcohol compared to girls or young women.
  2. Mother’s education: Parents’ education has been found to be a significant predictor of different social outcomes and well-being. Families with higher education require less support from sociaty in general. Differences in education level might also reflect differences in parents’ own behavior, values and parental support.
  3. Failure: Difficulties stack up and failures might lead to disaffection towards school, which in turn might lead to valuing other activities and social circles that accept or encourage alcohol consumption.
  4. Absences: Absences might be indications of disaffection and negative attitudes towards school, or other problems in life.








Exploring variables

Task 4: Exploring the chosen variables numerically and graphically


describe(alc[, c("sex", "Medu", "failures", "absences", "G3")]) %>% kable(digits = 3, caption = "<b>Table 2</b> Descriptives of chosen variables") %>% kable_styling(bootstrap_options = c("striped", "hover"))
Table 2 Descriptives of chosen variables
vars n mean sd median trimmed mad min max range skew kurtosis se
sex* 1 382 1.482 0.500 1 1.477 0.000 1 2 1 0.073 -2.000 0.026
Medu* 2 382 3.806 1.086 4 3.892 1.483 1 5 4 -0.384 -1.037 0.056
failures 3 382 0.202 0.583 0 0.033 0.000 0 3 3 3.034 8.689 0.030
absences 4 382 4.500 5.463 3 3.536 2.965 0 45 45 3.187 16.186 0.279
G3 5 382 11.458 3.310 12 11.631 2.965 0 18 18 -0.459 0.180 0.169


p1 <- ggplot(alc, aes(x = sex, fill = sex))
p1 + geom_bar() + theme(legend.position = "none") + ggtitle("Students' sex")+ ylab(" ") + xlab("Sex")

alc %>% count(sex) %>% kable(caption = "<b>Tale 3</b> Students' sex") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left")  %>%  column_spec(2, width = "15em")
Tale 3 Students’ sex
sex n
F 198
M 184
p2 <- ggplot(alc, aes(x = high_use, fill = sex))
p2 + geom_bar() + theme(legend.position = "none") + facet_wrap(~sex) + 
  labs(title = "High alcohol consumption of female (left) and male (right) students",
       caption = "TRUE = high consumption, FALSE = low consumption") +
  ylab(" ") + xlab("High alcohol consumption")

alc %>%
  group_by(high_use, sex)%>%
  summarise(n=n())%>%
  spread(sex, n) %>%
  kable(caption = "<b>Table 3</b> High/low alcohol consumption by students' sex crosstabulated") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 3 High/low alcohol consumption by students’ sex crosstabulated
high_use F M
FALSE 156 112
TRUE 42 72


p3 <- ggplot(alc, aes(x = Medu))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Mother's education") + ylab(" ") + xlab("Education level of mother")

alc %>%
  group_by(high_use, Medu)%>%
  summarise(n=n())%>%
  spread(Medu, n) %>%
  kable(caption = "<b>Table 4</b> High/low alcohol consumption by mother's education crosstabulated") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 4 High/low alcohol consumption by mother’s education crosstabulated
high_use None Primary 5th to 9th Secondary Higher
FALSE 1 33 80 59 95
TRUE 2 18 18 36 40
p3 <- ggplot(alc, aes(x = failures))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Class failures") + ylab(" ") + xlab("Number of class failures")

alc %>%
  group_by(high_use, failures)%>%
  summarise(n=n())%>%
  spread(failures, n) %>%
  kable(caption = "<b>Table 5</b> High/low alcohol consumption by mother's education crosstabulated") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 5 High/low alcohol consumption by mother’s education crosstabulated
high_use 0 1 2 3
FALSE 244 12 10 2
TRUE 90 12 9 3
p4 <- ggplot(alc, aes(x = absences))
p4 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Absences") + ylab(" ") + xlab("Number of absences")

p5 <- ggplot(alc, aes(x = high_use, y = absences, fill = sex)) 
p5 + geom_boxplot() + ggtitle("Boxplots of absences vs. high/low alcohol use by gender") + ylab("Number of absences") + xlab("High alcohol consumption")

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences)) %>% kable(digits = 2, caption = "<b>Table 5</b> Mean number absences by high/low alcohol consumption and sex") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 5 Mean number absences by high/low alcohol consumption and sex
sex high_use count mean_absences
F FALSE 156 4.22
F TRUE 42 6.79
M FALSE 112 2.98
M TRUE 72 6.12
p6 <- ggplot(alc, aes(x = G3))
p6 + geom_histogram(color = "white", fill = "deepskyblue2", bins=18) + theme_classic() + ggtitle("Grade") + ylab(" ") + xlab("Grade")

p7 <- ggplot(alc, aes(x = high_use, y = G3, fill = sex))
p7 + geom_boxplot() + ggtitle("Boxplots of grades vs. high/low alcohol use by gender") + ylab("Grade") + xlab("High alcohol consumption")

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3)) %>% kable(digits = 3, caption = "<b>Table 6</b> Mean grade by high/low alcohol consumption and sex") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 6 Mean grade by high/low alcohol consumption and sex
sex high_use count mean_grade
F FALSE 156 11.397
F TRUE 42 11.714
M FALSE 112 12.205
M TRUE 72 10.278








Logistic Regression Model

Task 5: Using logistic regression to statistically explore the relationships

# Fitting the logistic regression
mod1 <- glm(high_use ~ Medu + failures + absences + sex, data = alc, family = "binomial")

# Summary of the model
summary(mod1)
## 
## Call:
## glm(formula = high_use ~ Medu + failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3466  -0.8419  -0.6057   1.0340   2.3030  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.07693    1.26355   0.061   0.9514    
## MeduPrimary    -1.67979    1.29798  -1.294   0.1956    
## Medu5th to 9th -2.65576    1.29394  -2.052   0.0401 *  
## MeduSecondary  -1.72156    1.28764  -1.337   0.1812    
## MeduHigher     -2.00750    1.28668  -1.560   0.1187    
## failures        0.43321    0.20127   2.152   0.0314 *  
## absences        0.09627    0.02427   3.967 7.28e-05 ***
## sexM            0.97945    0.24932   3.928 8.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 412.83  on 374  degrees of freedom
## AIC: 428.83
## 
## Number of Fisher Scoring iterations: 4


As Medu does not consistently predict high alcohol consumption well (the only significant coefficient being class “5th to 9th” [p =.040], I omitted the variable from the model. This also makes the model easier to interpret.


# Fittiing Model 2
mod2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")

# Summary of Model 2
summary(mod2)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1855  -0.8371  -0.6000   1.1020   2.0209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.90297    0.22626  -8.411  < 2e-16 ***
## failures     0.45082    0.18992   2.374 0.017611 *  
## absences     0.09322    0.02295   4.063 4.85e-05 ***
## sexM         0.94117    0.24200   3.889 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 378  degrees of freedom
## AIC: 432.4
## 
## Number of Fisher Scoring iterations: 4
# Computing odds ratios and confidence intervals
OR <- coef(mod2) %>% exp
CI <- confint(mod2) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI) %>% kable(digits = 3, caption = "<b>Table 7</b> Odds ratios and their confidence intervals") %>%  kable_styling(bootstrap_options = c("striped", "hover"))
Table 7 Odds ratios and their confidence intervals
OR 2.5 % 97.5 %
(Intercept) 0.149 0.094 0.229
failures 1.570 1.083 2.295
absences 1.098 1.052 1.151
sexM 2.563 1.604 4.149

As we can see from the model summary, the intercept of high alcohol consumption is -1.90, which is more than 8 standard deviations (z value or the Wald’s Test value) away from 0 on the standard normal curve with a statistically significant p < .001. The slope coefficient of, for example, absences is 0.093. This means that for one point increase in absences the log of the odds of high alcohol consumption increases 0.09. The z values of coefficients of failures, absences, and sex ar positive and over 2 standard deviations away from 0 and are statistically significant with p < .05 for failures and p < .001 for absences and sex.

From the odds ratios we we can see that when the effect of the other predictor variables are taken into account…

  • The odds of high alcohol consumption increases about 8% to 130% with each class failure.
  • The odds of high alcohol consumption increases about 5 to 15 % with every absence.
  • The odds of male students to consume high amounts of alcohol is about one-and-a-half to to four times the odds of female students.

Conclusion: As expected, class failures, school absences, and student’s sex predict higher alcohol consumption. Male students are more likely be high alcohol consumers. Class failures and absentees also increase the probability of higher consumption. However, mother’s education doesn’t seem to predict alcohol use consistently.









Predictiven power

Task 6: Explore the predictive power of the model

# predict() the probability of high_use
probabilities <- predict(mod1, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, Medu, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##           Medu failures absences sex high_use probability prediction
## 373    Primary        1        0   M    FALSE  0.45259271      FALSE
## 374     Higher        1        7   M     TRUE  0.53891660       TRUE
## 375 5th to 9th        0        1   F    FALSE  0.07708996      FALSE
## 376     Higher        0        6   F    FALSE  0.20538904      FALSE
## 377 5th to 9th        1        2   F    FALSE  0.12421781      FALSE
## 378  Secondary        0        2   F    FALSE  0.18968092      FALSE
## 379    Primary        2        2   F    FALSE  0.36728009      FALSE
## 380    Primary        0        3   F    FALSE  0.21181023      FALSE
## 381  Secondary        0        4   M     TRUE  0.43043082      FALSE
## 382  Secondary        0        2   M     TRUE  0.38399298      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   257   11
##    TRUE     78   36
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67277487 0.02879581 0.70157068
##    TRUE  0.20418848 0.09424084 0.29842932
##    Sum   0.87696335 0.12303665 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2329843


The average number of frong predictions in the data is 23% using student’s sex, absences, and class failures as predictors. This means that the prediction was right about three times out of four. This is significantly better than by just guessing (error rate of 50%). The model was especially accurate at predicting low alcohol consumption by preidcting correctly 257 times out of 268. However, the model predicted wrong most of the cases where the alcohol consumption was categorized high.